The purpose of my analysis is to determine what movies I will enjoy the most. I will do this by creating a model with the response variable of how much I liked a movie out of 100 and predicting it with data on various movie reviews, genres, cast and crew, and more. I am doing this so that I can figure out which movies I will want to watch because I can be confident that I will enjoy them.
The data was collected over time by me. A partial dataset in json form is available on my website here: https://www.tradethisandthat.com/movies/api/all_movies/. In the references there is python code I wrote to turn the mySQL database into a csv. Data from TMDB was collected using their open API. Data from IMDB was collected by web scraping their page for awards and finding any Oscars as well as scraping their page for rating distributions. Much of the data is gotten with python code that I run on movie addition with Django. My rating and metacritic ratings are collected by me.
I tried fitting many different models including SLRs, MLRs, polynomials, montone transformations, ridge, lasso, and GAMs.
# begin with some simple models
fit1 = lm(my_rating~imdb_rating, data=continuous_movies)
summary(fit1)
##
## Call:
## lm(formula = my_rating ~ imdb_rating, data = continuous_movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.975 -8.660 2.319 10.025 50.625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.1699 4.4286 -8.167 1.96e-15 ***
## imdb_rating 12.9445 0.6136 21.095 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.13 on 584 degrees of freedom
## Multiple R-squared: 0.4324, Adjusted R-squared: 0.4315
## F-statistic: 445 on 1 and 584 DF, p-value: < 2.2e-16
fit2 = lm(my_rating~metacritic_rating, data=continuous_movies)
summary(fit2)
##
## Call:
## lm(formula = my_rating ~ metacritic_rating, data = continuous_movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.37 -10.01 1.52 11.83 38.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.80834 2.91499 5.423 8.59e-08 ***
## metacritic_rating 0.61262 0.04279 14.316 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.14 on 584 degrees of freedom
## Multiple R-squared: 0.2598, Adjusted R-squared: 0.2585
## F-statistic: 205 on 1 and 584 DF, p-value: < 2.2e-16
fit3 = lm(my_rating~tmdb_rating, data=continuous_movies)
summary(fit3)
##
## Call:
## lm(formula = my_rating ~ tmdb_rating, data = continuous_movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.043 -8.252 2.349 10.104 40.173
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -51.0368 5.7467 -8.881 <2e-16 ***
## tmdb_rating 15.1543 0.8057 18.808 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.8 on 584 degrees of freedom
## Multiple R-squared: 0.3772, Adjusted R-squared: 0.3762
## F-statistic: 353.7 on 1 and 584 DF, p-value: < 2.2e-16
plot(continuous_movies$my_rating, fit1$fitted.values,xlim=c(0,100),ylim=c(0,100),xlab="My Rating",ylab="Fitted Values",main="My Rating vs Fitted Values",col="orchid")
points(continuous_movies$my_rating,fit2$fitted.values,col="aquamarine3")
points(continuous_movies$my_rating,fit3$fitted.values,col="firebrick4")
legend(x = "topleft", title="Models", bg="transparent",
legend=c("IMDB", "TMDB",'Metacritic'),
fill = c("orchid","aquamarine3",'firebrick4'))
abline(0,1)
fit4 = lm(my_rating~imdb_rating+metacritic_rating+tmdb_rating,data=continuous_movies) # create an MLR with just movie ratings from other sources
summary(fit4)
##
## Call:
## lm(formula = my_rating ~ imdb_rating + metacritic_rating + tmdb_rating,
## data = continuous_movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.326 -8.704 2.258 9.844 49.695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40.48555 5.79723 -6.984 7.88e-12 ***
## imdb_rating 10.53253 1.55661 6.766 3.23e-11 ***
## metacritic_rating 0.03985 0.05674 0.702 0.483
## tmdb_rating 2.66900 1.78948 1.491 0.136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.12 on 582 degrees of freedom
## Multiple R-squared: 0.4352, Adjusted R-squared: 0.4323
## F-statistic: 149.5 on 3 and 582 DF, p-value: < 2.2e-16
fit5 = lm(my_rating~.-name,data=continuous_movies) # create an MLR with all linear terms
summary(fit5)
##
## Call:
## lm(formula = my_rating ~ . - name, data = continuous_movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.881 -8.154 1.378 8.372 51.355
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.418e+00 2.237e+01 0.287 0.77433
## imdb_rating 1.688e+00 6.637e+00 0.254 0.79931
## tmdb_rating -6.263e+00 5.267e+00 -1.189 0.23489
## tmdb_count -9.209e-05 3.765e-04 -0.245 0.80688
## metacritic_rating -1.867e-01 1.295e-01 -1.442 0.15002
## budget -1.747e-08 1.589e-08 -1.100 0.27205
## revenue -1.436e-09 2.927e-09 -0.490 0.62401
## runtime -3.679e-02 4.214e-02 -0.873 0.38311
## award_count -4.153e-01 3.042e-01 -1.365 0.17270
## imdb_count 3.693e-05 1.623e-05 2.276 0.02323 *
## imdb_arithmetic_mean -2.288e+00 3.410e+00 -0.671 0.50251
## imdb_median -1.787e-01 8.727e-01 -0.205 0.83786
## imdb_top_1000_rating 1.790e+00 1.875e+00 0.955 0.34008
## imdb_top_1000_count 1.889e-02 6.372e-03 2.965 0.00316 **
## imdb_us_rating 6.948e+00 4.351e+00 1.597 0.11095
## imdb_us_count -1.267e-04 5.932e-05 -2.136 0.03315 *
## imdb_not_us_rating 4.190e+00 4.777e+00 0.877 0.38082
## imdb_not_us_count -4.886e-05 3.790e-05 -1.289 0.19786
## mpaa_name 2.612e-01 6.276e-01 0.416 0.67745
## imdb_rating_percentile -1.564e-01 1.141e-01 -1.371 0.17095
## tmdb_rating_percentile 2.987e-01 1.475e-01 2.025 0.04334 *
## metacritic_rating_percentile 1.090e-01 6.191e-02 1.760 0.07899 .
## is_action 2.840e+00 1.642e+00 1.730 0.08420 .
## is_comedy 1.091e+00 1.580e+00 0.690 0.49025
## is_adventure -2.751e+00 1.534e+00 -1.793 0.07351 .
## is_animation 1.431e+00 2.656e+00 0.539 0.59023
## is_family 3.269e+00 2.376e+00 1.376 0.16945
## is_drama 4.133e-01 1.761e+00 0.235 0.81450
## is_scifi 1.322e+00 1.673e+00 0.790 0.42970
## is_thriller 8.779e-01 1.785e+00 0.492 0.62305
## is_brad_pitt 4.318e+00 4.034e+00 1.070 0.28491
## is_stan_lee 3.593e+00 3.048e+00 1.179 0.23899
## is_christopher_nolan 5.361e+00 6.398e+00 0.838 0.40253
## is_spielberg -1.090e+00 3.446e+00 -0.316 0.75200
## is_harrison_ford 3.944e+00 5.412e+00 0.729 0.46650
## is_matt_damon 5.681e+00 4.157e+00 1.367 0.17235
## is_wes_anderson 7.329e+00 6.880e+00 1.065 0.28724
## is_tom_cruise 1.933e+00 4.323e+00 0.447 0.65499
## is_john_williams 1.013e+01 4.516e+00 2.243 0.02531 *
## is_rdj -5.650e-01 4.649e+00 -0.122 0.90333
## is_quentin_tarantino -5.106e+00 5.865e+00 -0.870 0.38443
## is_tom_hanks 2.609e+00 4.119e+00 0.633 0.52670
## is_george_lucas 1.781e+00 5.621e+00 0.317 0.75146
## is_leonardo_dicaprio 1.055e+00 5.595e+00 0.188 0.85057
## is_the_rock 3.416e+00 3.962e+00 0.862 0.38900
## is_stanley_kubrick 1.951e+00 6.408e+00 0.304 0.76095
## is_john_hughes 2.951e+00 5.159e+00 0.572 0.56758
## is_jim_carrey 7.809e+00 5.203e+00 1.501 0.13396
## is_wally_pfister 5.665e+00 6.792e+00 0.834 0.40461
## is_henry_fonda 1.756e+01 1.366e+01 1.286 0.19899
## is_morgan_freeman -2.065e+00 4.808e+00 -0.430 0.66770
## is_bong_joon_ho 5.757e+00 8.088e+00 0.712 0.47690
## is_dustin_hoffman 9.500e+00 5.612e+00 1.693 0.09108 .
## is_arnold_schwarz -5.685e+00 8.090e+00 -0.703 0.48252
## is_jack_nicholson -1.476e+01 8.111e+00 -1.820 0.06928 .
## is_aamir_khan 2.371e+01 1.403e+01 1.691 0.09151 .
## is_sean_connery 1.366e+01 8.103e+00 1.685 0.09250 .
## is_brad_bird 7.857e+00 6.328e+00 1.242 0.21494
## is_natalie_portman 2.629e+00 5.006e+00 0.525 0.59976
## is_robin_williams 1.037e+01 5.239e+00 1.980 0.04826 *
## is_sandra_bullock 3.728e+00 5.572e+00 0.669 0.50377
## is_bill_murray 5.583e+00 4.935e+00 1.131 0.25844
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.25 on 524 degrees of freedom
## Multiple R-squared: 0.5526, Adjusted R-squared: 0.5005
## F-statistic: 10.61 on 61 and 524 DF, p-value: < 2.2e-16
fit6 = lm(my_rating~imdb_count+imdb_top_1000_count+imdb_us_count+imdb_not_us_count,data=continuous_movies)
summary(fit6) # create an MLR with just movie rating counts
##
## Call:
## lm(formula = my_rating ~ imdb_count + imdb_top_1000_count + imdb_us_count +
## imdb_not_us_count, data = continuous_movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.720 -9.889 1.807 11.038 43.118
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.725e+01 2.108e+00 17.675 < 2e-16 ***
## imdb_count 4.757e-05 1.162e-05 4.095 4.82e-05 ***
## imdb_top_1000_count 2.468e-02 4.845e-03 5.093 4.76e-07 ***
## imdb_us_count -1.000e-05 4.480e-05 -0.223 0.8234
## imdb_not_us_count -8.532e-05 3.331e-05 -2.561 0.0107 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.8 on 581 degrees of freedom
## Multiple R-squared: 0.2944, Adjusted R-squared: 0.2895
## F-statistic: 60.59 on 4 and 581 DF, p-value: < 2.2e-16
plot(continuous_movies$my_rating, fit4$fitted.values,xlim=c(0,100),ylim=c(0,100),xlab="My Rating",ylab="Fitted Values",main="My Rating vs Fitted Values",col="orchid")
points(continuous_movies$my_rating,fit5$fitted.values,col="aquamarine3")
points(continuous_movies$my_rating,fit6$fitted.values,col="firebrick4")
legend(x = "topleft", title="Models", bg="transparent",
legend=c("IMDB+TMDB+Metacritic", "All",'All Counts'),
fill = c("orchid","aquamarine3",'firebrick4'))
abline(0,1)
## Transformed and Polynomial MLRs
fit7 = lm(sqrt(my_rating)~.-name,data=continuous_movies) # MLR with full model to sqrt of my rating
summary(fit7)
##
## Call:
## lm(formula = sqrt(my_rating) ~ . - name, data = continuous_movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8407 -0.5095 0.1454 0.6056 3.9915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.353e+00 1.695e+00 1.978 0.04841 *
## imdb_rating 5.560e-01 5.028e-01 1.106 0.26928
## tmdb_rating -7.435e-01 3.990e-01 -1.864 0.06294 .
## tmdb_count -1.024e-05 2.853e-05 -0.359 0.71973
## metacritic_rating -1.458e-02 9.813e-03 -1.486 0.13792
## budget -1.186e-09 1.204e-09 -0.986 0.32482
## revenue -4.278e-11 2.218e-10 -0.193 0.84709
## runtime -2.128e-03 3.192e-03 -0.667 0.50535
## award_count -3.046e-02 2.304e-02 -1.322 0.18675
## imdb_count 2.406e-06 1.229e-06 1.958 0.05078 .
## imdb_arithmetic_mean -2.659e-01 2.583e-01 -1.029 0.30379
## imdb_median -4.245e-03 6.611e-02 -0.064 0.94883
## imdb_top_1000_rating 1.327e-01 1.420e-01 0.935 0.35045
## imdb_top_1000_count 1.531e-03 4.827e-04 3.171 0.00161 **
## imdb_us_rating 5.424e-01 3.296e-01 1.645 0.10047
## imdb_us_count -9.192e-06 4.493e-06 -2.046 0.04128 *
## imdb_not_us_rating 2.895e-01 3.619e-01 0.800 0.42405
## imdb_not_us_count -3.273e-06 2.871e-06 -1.140 0.25473
## mpaa_name 1.894e-02 4.754e-02 0.398 0.69046
## imdb_rating_percentile -2.235e-02 8.641e-03 -2.587 0.00995 **
## tmdb_rating_percentile 2.926e-02 1.117e-02 2.618 0.00909 **
## metacritic_rating_percentile 7.776e-03 4.690e-03 1.658 0.09792 .
## is_action 2.064e-01 1.244e-01 1.659 0.09765 .
## is_comedy 5.186e-02 1.197e-01 0.433 0.66505
## is_adventure -2.185e-01 1.162e-01 -1.880 0.06067 .
## is_animation 1.211e-01 2.012e-01 0.602 0.54760
## is_family 2.445e-01 1.800e-01 1.358 0.17493
## is_drama 1.508e-02 1.334e-01 0.113 0.91000
## is_scifi 9.787e-02 1.267e-01 0.772 0.44025
## is_thriller 7.312e-02 1.352e-01 0.541 0.58890
## is_brad_pitt 3.188e-01 3.056e-01 1.043 0.29732
## is_stan_lee 2.609e-01 2.309e-01 1.130 0.25900
## is_christopher_nolan 4.147e-01 4.847e-01 0.856 0.39259
## is_spielberg -8.074e-02 2.610e-01 -0.309 0.75723
## is_harrison_ford 2.389e-01 4.100e-01 0.583 0.56025
## is_matt_damon 3.650e-01 3.149e-01 1.159 0.24705
## is_wes_anderson 5.662e-01 5.212e-01 1.086 0.27779
## is_tom_cruise 1.565e-01 3.275e-01 0.478 0.63299
## is_john_williams 6.487e-01 3.421e-01 1.896 0.05845 .
## is_rdj -1.184e-02 3.522e-01 -0.034 0.97321
## is_quentin_tarantino -3.294e-01 4.443e-01 -0.741 0.45875
## is_tom_hanks 1.787e-01 3.120e-01 0.573 0.56716
## is_george_lucas 1.235e-01 4.258e-01 0.290 0.77186
## is_leonardo_dicaprio 1.099e-01 4.239e-01 0.259 0.79548
## is_the_rock 2.350e-01 3.002e-01 0.783 0.43400
## is_stanley_kubrick 1.301e-01 4.854e-01 0.268 0.78877
## is_john_hughes 2.323e-01 3.908e-01 0.594 0.55251
## is_jim_carrey 5.674e-01 3.941e-01 1.440 0.15058
## is_wally_pfister 3.609e-01 5.145e-01 0.701 0.48333
## is_henry_fonda 7.873e-01 1.034e+00 0.761 0.44697
## is_morgan_freeman -2.518e-01 3.642e-01 -0.691 0.48968
## is_bong_joon_ho 3.681e-01 6.127e-01 0.601 0.54825
## is_dustin_hoffman 6.196e-01 4.251e-01 1.457 0.14561
## is_arnold_schwarz -4.388e-01 6.128e-01 -0.716 0.47431
## is_jack_nicholson -9.427e-01 6.144e-01 -1.534 0.12556
## is_aamir_khan 1.469e+00 1.063e+00 1.382 0.16749
## is_sean_connery 8.199e-01 6.138e-01 1.336 0.18221
## is_brad_bird 5.077e-01 4.794e-01 1.059 0.29003
## is_natalie_portman 1.896e-01 3.792e-01 0.500 0.61731
## is_robin_williams 7.056e-01 3.968e-01 1.778 0.07599 .
## is_sandra_bullock 2.445e-01 4.221e-01 0.579 0.56269
## is_bill_murray 2.808e-01 3.739e-01 0.751 0.45292
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.003 on 524 degrees of freedom
## Multiple R-squared: 0.5311, Adjusted R-squared: 0.4766
## F-statistic: 9.731 on 61 and 524 DF, p-value: < 2.2e-16
fit8 = lm(my_rating~log(imdb_count)+imdb_rating,data=continuous_movies) # simple model with my_rating to imdb_rating and log(imdb_count)
summary(fit8)
##
## Call:
## lm(formula = my_rating ~ log(imdb_count) + imdb_rating, data = continuous_movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.555 -8.591 1.701 9.537 51.826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -55.8570 5.3204 -10.499 < 2e-16 ***
## log(imdb_count) 2.7418 0.4381 6.258 7.57e-10 ***
## imdb_rating 11.0089 0.6702 16.427 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.69 on 583 degrees of freedom
## Multiple R-squared: 0.4682, Adjusted R-squared: 0.4663
## F-statistic: 256.6 on 2 and 583 DF, p-value: < 2.2e-16
fit9 = lm(my_rating~imdb_rating+I(imdb_rating^2)+I(imdb_rating^3),data=continuous_movies) # fitting a cubic model with imdb_rating
summary(fit9)
##
## Call:
## lm(formula = my_rating ~ imdb_rating + I(imdb_rating^2) + I(imdb_rating^3),
## data = continuous_movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.486 -8.483 2.207 9.741 48.700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.9929 71.1087 0.689 0.491
## imdb_rating -24.2463 33.8272 -0.717 0.474
## I(imdb_rating^2) 5.2690 5.2629 1.001 0.317
## I(imdb_rating^3) -0.2430 0.2679 -0.907 0.365
##
## Residual standard error: 14.13 on 582 degrees of freedom
## Multiple R-squared: 0.4347, Adjusted R-squared: 0.4318
## F-statistic: 149.2 on 3 and 582 DF, p-value: < 2.2e-16
fit10 = lm(sqrt(my_rating)~sqrt(imdb_count)+sqrt(imdb_rating), data=continuous_movies) # fitting with sqrts
summary(fit10)
##
## Call:
## lm(formula = sqrt(my_rating) ~ sqrt(imdb_count) + sqrt(imdb_rating),
## data = continuous_movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9775 -0.5516 0.1587 0.6845 3.7046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.8029151 0.7059644 -5.387 1.04e-07 ***
## sqrt(imdb_count) 0.0009896 0.0001805 5.484 6.20e-08 ***
## sqrt(imdb_rating) 3.9885846 0.2842290 14.033 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.026 on 583 degrees of freedom
## Multiple R-squared: 0.4544, Adjusted R-squared: 0.4525
## F-statistic: 242.8 on 2 and 583 DF, p-value: < 2.2e-16
fit11 = lm(my_rating ~ polym(imdb_rating, imdb_count,award_count, degree=5, raw=TRUE),data=continuous_movies) # fitting with a very big polynomial
# summary(fit11)
plot(continuous_movies$my_rating, fit7$fitted.values^2,xlim=c(0,100),ylim=c(0,100),xlab="My Rating",ylab="Fitted Values",main="My Rating vs Fitted Values",col="orchid")
points(continuous_movies$my_rating,fit8$fitted.values,col="aquamarine3")
points(continuous_movies$my_rating,fit9$fitted.values,col="firebrick4")
points(continuous_movies$my_rating,fit10$fitted.values^2,col="navajowhite")
points(continuous_movies$my_rating,fit11$fitted.values^2,col="royalblue")
legend(x = "topleft", title="Models", bg="transparent",
legend=c("SQRT~All", "IMDB+log(imdb_count)",'IMDB^3','SQRT~SQRT','polym'),
fill = c("orchid","aquamarine3",'firebrick4','navajowhite','royalblue'))
abline(0,1)
Create data matrix to be used for both ridge and lasso.
X = model.matrix(~ -1 + sqrt(imdb_count) + imdb_rating + sqrt(award_count), data = continuous_movies) # model to use for ridge and lasso fits
y = continuous_movies$my_rating
fit_lm = lm(my_rating ~ sqrt(imdb_count)+ imdb_rating + sqrt(award_count), data = continuous_movies) # non-ridge and lasso to compare to
fit_ridge = glmnet(X,y,alpha=0) # ridge fit
fit.cv.ridge = cv.glmnet(X,y,alpha=0)
plot(fit.cv.ridge)
fit_lasso = glmnet(X,y,alpha=1) # lasso fit
fit.cv.lasso = cv.glmnet(X,y,alpha=1)
plot(fit.cv.lasso)
beta_hat_mlr = coef(fit_lm)
beta_hat_ridge = coef(fit.cv.ridge, s = "lambda.1se")
beta_hat_lasso = coef(fit.cv.lasso, s = "lambda.1se")
cbind(beta_hat_mlr, beta_hat_ridge, beta_hat_lasso)
## 4 x 3 sparse Matrix of class "dgCMatrix"
## beta_hat_mlr s1 s1
## (Intercept) -24.83846628 3.53534323 -8.363625498
## sqrt(imdb_count) 0.01507946 0.01297789 0.009043295
## imdb_rating 10.20844225 6.25614967 8.363468039
## sqrt(award_count) -0.05764352 1.50753433 .
plot(continuous_movies$my_rating,predict(glmnet(X,y,alpha=0),as.matrix(X),s=fit.cv.ridge$lambda.1se),xlim=c(0,100),ylim=c(0,100),xlab="My Rating",ylab="Fitted Values",main="My Rating vs Fitted Values",col="orchid")
points(continuous_movies$my_rating,predict(glmnet(X,y,alpha=1),as.matrix(X),s=fit.cv.lasso$lambda.1se),col="aquamarine3")
points(continuous_movies$my_rating,fit_lm$fitted.values,col="firebrick4")
legend(x = "topleft", title="Models",
legend=c("Ridge", "Lasso", "MLR"),
fill = c("orchid","aquamarine3",'firebrick4'))
abline(0,1)
library(leaps) # procedure from lab
fit_full = lm(my_rating~.-name+log(imdb_count)+log(tmdb_count)+log(imdb_us_count)+log(imdb_not_us_count), data=continuous_movies) # start by creating a full and add some log terms for values not bounded to a specific range
fit_null = lm(my_rating~1, data=continuous_movies) # create a null fit to just an intercept
anova(fit_null,fit_full,test='F') # make sure that the full fit is more significant than the null fit
Because the p-value for the ful model is significant in comparison to the null model, it makes sense to continue with AIC and BIC to find a significant model. The use of AIC and BIC is to penalize adding more parameters because any parameter improves the fit.
fit_aic = step(fit_null,list(upper=fit_full),direction='forward') # run AIC
n = nrow(continuous_movies)
fit_bic = step(fit_null,list(upper=fit_full),direction='forward',k=log(n)) # run BIC
summary(fit_aic)
##
## Call:
## lm(formula = my_rating ~ imdb_rating + log(imdb_count) + is_john_williams +
## tmdb_rating_percentile + is_robin_williams + is_bill_murray +
## is_sean_connery + is_jack_nicholson + is_christopher_nolan +
## is_stan_lee + is_comedy + is_matt_damon + imdb_us_rating +
## is_dustin_hoffman + is_action + is_adventure + is_family +
## is_henry_fonda + is_aamir_khan + imdb_not_us_rating, data = continuous_movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.208 -8.048 0.963 8.901 51.492
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.54690 8.02058 -4.557 6.37e-06 ***
## imdb_rating -5.03237 5.51677 -0.912 0.362055
## log(imdb_count) 2.18695 0.45396 4.817 1.87e-06 ***
## is_john_williams 11.04011 2.89969 3.807 0.000156 ***
## tmdb_rating_percentile 0.11880 0.04087 2.907 0.003793 **
## is_robin_williams 10.74979 5.07432 2.118 0.034571 *
## is_bill_murray 10.04111 3.87296 2.593 0.009771 **
## is_sean_connery 16.23032 7.74485 2.096 0.036560 *
## is_jack_nicholson -14.80868 7.67276 -1.930 0.054103 .
## is_christopher_nolan 9.18126 4.34181 2.115 0.034900 *
## is_stan_lee 4.44786 2.38016 1.869 0.062179 .
## is_comedy 2.03353 1.33940 1.518 0.129514
## is_matt_damon 7.14619 3.89592 1.834 0.067139 .
## imdb_us_rating 7.19509 3.22451 2.231 0.026048 *
## is_dustin_hoffman 7.94873 5.42923 1.464 0.143733
## is_action 3.48623 1.43060 2.437 0.015122 *
## is_adventure -3.04652 1.35713 -2.245 0.025166 *
## is_family 3.13360 1.53023 2.048 0.041041 *
## is_henry_fonda 19.83960 13.23803 1.499 0.134515
## is_aamir_khan 20.41390 13.39043 1.525 0.127940
## imdb_not_us_rating 5.85220 3.93296 1.488 0.137312
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.15 on 565 degrees of freedom
## Multiple R-squared: 0.5249, Adjusted R-squared: 0.5081
## F-statistic: 31.21 on 20 and 565 DF, p-value: < 2.2e-16
summary(fit_bic)
##
## Call:
## lm(formula = my_rating ~ imdb_rating + log(imdb_count) + is_john_williams,
## data = continuous_movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.988 -8.279 1.523 9.610 51.665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -53.7527 5.3313 -10.083 < 2e-16 ***
## imdb_rating 10.9580 0.6659 16.456 < 2e-16 ***
## log(imdb_count) 2.5708 0.4389 5.857 7.89e-09 ***
## is_john_williams 8.5850 2.8724 2.989 0.00292 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.6 on 582 degrees of freedom
## Multiple R-squared: 0.4762, Adjusted R-squared: 0.4735
## F-statistic: 176.4 on 3 and 582 DF, p-value: < 2.2e-16
AIC(fit_bic)
## [1] 4727.974
plot(continuous_movies$my_rating, fit_aic$fitted.values,xlim=c(0,100),ylim=c(0,100),xlab="My Rating",ylab="Fitted Values",main="My Rating vs Fitted Values",col="orchid")
points(continuous_movies$my_rating,fit_bic$fitted.values,col="aquamarine3")
legend(x = "topleft", title="Models",
legend=c("AIC", "BIC"),
fill = c("orchid","aquamarine3"))
abline(0,1)
library(mgcv)
fit_gam1 = gam(my_rating~s(imdb_rating)+s(metacritic_rating)+s(tmdb_rating)+s(imdb_count)+s(award_count)+s(runtime),data=continuous_movies)
summary(fit_gam1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## my_rating ~ s(imdb_rating) + s(metacritic_rating) + s(tmdb_rating) +
## s(imdb_count) + s(award_count) + s(runtime)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.4343 0.5606 100.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(imdb_rating) 1.000 1.000 24.029 1.23e-06 ***
## s(metacritic_rating) 3.051 3.893 1.081 0.318
## s(tmdb_rating) 2.196 2.822 1.156 0.332
## s(imdb_count) 3.066 3.836 8.186 3.31e-06 ***
## s(award_count) 1.621 2.015 0.613 0.535
## s(runtime) 2.471 3.199 1.637 0.195
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.476 Deviance explained = 48.8%
## GCV = 188.82 Scale est. = 184.17 n = 586
fit_gam2 = gam(my_rating~s(imdb_rating)+s(metacritic_rating)+s(tmdb_rating)+s(sqrt(imdb_count))+s(sqrt(award_count))+s(runtime)+s(budget)+s(imdb_arithmetic_mean),data=continuous_movies)
summary(fit_gam2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## my_rating ~ s(imdb_rating) + s(metacritic_rating) + s(tmdb_rating) +
## s(sqrt(imdb_count)) + s(sqrt(award_count)) + s(runtime) +
## s(budget) + s(imdb_arithmetic_mean)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.4343 0.5538 101.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(imdb_rating) 1.000 1.000 10.144 0.00153 **
## s(metacritic_rating) 3.146 4.013 1.241 0.29384
## s(tmdb_rating) 2.612 3.342 1.489 0.18652
## s(sqrt(imdb_count)) 1.000 1.000 19.511 1.19e-05 ***
## s(sqrt(award_count)) 2.597 3.143 0.883 0.45300
## s(runtime) 2.551 3.302 1.410 0.26940
## s(budget) 4.788 5.851 1.875 0.09210 .
## s(imdb_arithmetic_mean) 1.000 1.000 0.546 0.46017
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.488 Deviance explained = 50.5%
## GCV = 186 Scale est. = 179.74 n = 586
fit_gam3 = gam(my_rating~imdb_rating*is_animation+imdb_rating*is_family + imdb_rating*is_adventure + imdb_rating*is_action+imdb_rating*is_drama+imdb_rating*is_comedy+imdb_rating*is_thriller+imdb_rating*is_scifi+s(imdb_count)+imdb_rating*mpaa_name+revenue*is_animation+revenue*is_action+revenue*is_drama+revenue*is_comedy+revenue*is_thriller+revenue*is_scifi,data=continuous_movies)
summary(fit_gam3)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## my_rating ~ imdb_rating * is_animation + imdb_rating * is_family +
## imdb_rating * is_adventure + imdb_rating * is_action + imdb_rating *
## is_drama + imdb_rating * is_comedy + imdb_rating * is_thriller +
## imdb_rating * is_scifi + s(imdb_count) + imdb_rating * mpaa_name +
## revenue * is_animation + revenue * is_action + revenue *
## is_drama + revenue * is_comedy + revenue * is_thriller +
## revenue * is_scifi
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.705e-02 1.032e-03 45.577 < 2e-16 ***
## imdb_rating 3.437e-01 7.540e-03 45.577 < 2e-16 ***
## is_animation 4.759e-03 1.044e-04 45.577 < 2e-16 ***
## is_family 6.716e-03 1.474e-04 45.577 < 2e-16 ***
## is_adventure 9.811e-03 2.153e-04 45.577 < 2e-16 ***
## is_action 1.447e-02 3.176e-04 45.577 < 2e-16 ***
## is_drama 1.966e-02 4.313e-04 45.577 < 2e-16 ***
## is_comedy 1.836e-02 4.028e-04 45.577 < 2e-16 ***
## is_thriller 1.051e-02 2.307e-04 45.577 < 2e-16 ***
## is_scifi 1.010e-02 2.216e-04 45.577 < 2e-16 ***
## mpaa_name 2.136e-01 4.686e-03 45.577 < 2e-16 ***
## revenue 2.050e-08 6.325e-09 3.241 0.00126 **
## imdb_rating:is_animation 3.333e-02 7.314e-04 45.577 < 2e-16 ***
## imdb_rating:is_family 4.502e-02 9.878e-04 45.577 < 2e-16 ***
## imdb_rating:is_adventure 6.712e-02 1.473e-03 45.577 < 2e-16 ***
## imdb_rating:is_action 1.017e-01 2.232e-03 45.577 < 2e-16 ***
## imdb_rating:is_drama 1.503e-01 3.297e-03 45.577 < 2e-16 ***
## imdb_rating:is_comedy 1.265e-01 2.776e-03 45.577 < 2e-16 ***
## imdb_rating:is_thriller 7.726e-02 1.695e-03 45.577 < 2e-16 ***
## imdb_rating:is_scifi 7.084e-02 1.554e-03 45.577 < 2e-16 ***
## imdb_rating:mpaa_name 1.561e+00 3.426e-02 45.577 < 2e-16 ***
## is_animation:revenue 1.662e-08 6.052e-09 2.745 0.00623 **
## is_action:revenue -9.210e-09 5.256e-09 -1.752 0.08023 .
## is_drama:revenue 7.182e-09 5.984e-09 1.200 0.23056
## is_comedy:revenue -2.528e-09 5.225e-09 -0.484 0.62866
## is_thriller:revenue -5.431e-09 5.659e-09 -0.960 0.33764
## is_scifi:revenue 6.822e-10 4.677e-09 0.146 0.88408
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(imdb_count) 6.391e-05 6.391e-05 32503749 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 8/36
## R-sq.(adj) = -0.067 Deviance explained = -7.02%
## GCV = 385.77 Scale est. = 380.5 n = 586
plot(continuous_movies$my_rating, fit_gam1$fitted.values,xlim=c(0,100),ylim=c(0,100),xlab="My Rating",ylab="Fitted Values",main="My Rating vs Fitted Values",col="orchid")
points(continuous_movies$my_rating,fit_gam2$fitted.values,col="aquamarine3")
points(continuous_movies$my_rating,fit_gam3$fitted.values,col="firebrick4")
legend(x = "topleft", title="Models",
legend=c("GAM 1", "GAM 2", "GAM 3"),
fill = c("orchid","aquamarine3",'firebrick4'))
abline(0,1)
plot(fit_gam1)
plot(fit_gam2)
plot(fit_gam3)
The fit plots have been withheld for spacing reasons. For the first gam: IMDB rating and metacritic rating are linear. TMDB rating is linear until the rating reaches about 6 before increase in slope between 6.5 and 7.5 before flattening. This means that the impact of an increase in TMDB rating from 7.5 to 7.6 is expected to have a larger impact on my rating than an increase from 6.4 to 6.5. IMDB count has a very steep slope until it reaches around 500,000 before flattening out. This impact is fixed by using a log transformation Finally, runtime actually flips. When runtime is under an hour, an increase in runtime leads to an expected increase in my rating. However, as runtime passes 150 minutes, a higher runtime means an expected decrease in my rating.
For the second gam: All terms are linear except for TMDB rating, runtime, and budget TMDB rating and runtime have already been discussed. IMDB arithmetic rating is a new addition to this model and actually has a negative beta model. This exemplifies the slight difference in the way the public IMDB rating is calculated vs the arithmetic mean. The public IMDB rating is calculated using a secret formula to prevent review bombing and is actually a weighted average. This reveals that using the weighted mean is more powerful than the arithmetic mean and that the weighted average used by IMDB is useful.
For budget, there is a quick slope upward at the beginning revealing that a percentage increase in budget is more important that a dollar amount increase. However, this should be taken with a grain of salt as there are flaws to using budget as a prediction metric. Firstly, movies have been made at all different times and budgets are not normalized for inflation. This harms the predictive power of using budget. Moreover, some movies do not have public budget information which is related to their popularity. The mean IMDB count for a movie with a budget that is not 0 is 4.3154913^{5} while for movies with a budget that is 0 the mean count is 3.8796132^{4}, more than 10 times smaller.
The main conclusions of my analysis
from django.core.management.base import BaseCommand
from movies.models import *
import csv, os
class Command(BaseCommand):
help = 'create full database from csv'
def handle(self, *args, **options):
FOLDER = os.path.dirname(os.path.abspath(__file__))
with open(os.path.join(FOLDER, './csv/movie.csv'), 'w',
encoding='UTF8', newline='') as csvfile:
filewriter = csv.writer(csvfile)
filewriter.writerow(
['uuid', 'franchise_id', 'mpaa_id', 'imdb_rating', 'metacritic_rating', 'my_rating', 'tmdb_id',
'imdb_id', 'name', 'tmdb_rating', 'tmdb_count', 'poster', 'runtime', 'release_date', 'recent_watch',
'viewing_count', 'release_day', 'release_month', 'release_year', 'revenue', 'budget',
'distance_from_rating_average', 'my_rating_percentile', 'imdb_rating_percentile',
'tmdb_rating_percentile', 'metacritic_rating_percentile', 'genre_ids', 'award_ids', 'award_count',
'production_company_ids', 'franchise_name', 'mpaa_name', 'genres_names', 'genre_numbers',
'award_names', 'production_company_names', 'imdb_count', 'imdb_arithmetic_mean', 'imdb_median',
'imdb_top_1000_rating', 'imdb_top_1000_count', 'imdb_us_rating', 'imdb_us_count', 'imdb_not_us_rating',
'imdb_not_us_count']
)
genres = WatchInfo.objects.filter(watch_info_type='Genre')
genre_dict = {genre: i for i, genre in enumerate(genres)}
for movie in Movie.objects.all():
filewriter.writerow(
[movie.uuid, (movie.franchise_id if movie.franchise is not None else ''),
(movie.mpaa_rating_id if movie.mpaa_rating is not None else ''), movie.imdb_rating,
movie.metacritic_rating, movie.my_rating_field, movie.tmdb_id, movie.imdb_id, movie.name,
movie.tmdb_rating, movie.tmdb_count, movie.poster, movie.runtime, movie.release_date,
movie.recent_watch, movie.viewing_count, movie.release_day.name, movie.release_month.name,
movie.release_year.name, movie.revenue, movie.budget, movie.distance_from_rating_average,
movie.rating_percentile, movie.imdb_rating_percentile, movie.tmdb_rating_percentile,
movie.metacritic_rating_percentile,
[x.pk for x in movie.genres.all()], [x.pk for x in movie.awards.all()], movie.award_count,
[x.pk for x in movie.production_companies.all()],
(movie.franchise.name if movie.franchise is not None else ''),
(movie.mpaa_rating.name if movie.mpaa_rating is not None else ''),
[x.name for x in movie.genres.all()], [genre_dict[x] for x in movie.genres.all()],
[x.name for x in movie.awards.all()],
[x.name for x in movie.production_companies.all()], movie.imdb_count, movie.imdb_arithmetic_mean,
movie.imdb_median, movie.imdb_top_1000_rating, movie.imdb_top_1000_count, movie.imdb_us_rating,
movie.imdb_us_count, movie.imdb_not_us_rating, movie.imdb_not_us_count
]
)
with open(os.path.join(FOLDER, './csv/place.csv'), 'w',
encoding='UTF8', newline='') as csvfile:
filewriter = csv.writer(csvfile)
filewriter.writerow(['uuid', 'city', 'state', 'country'])
for place in Place.objects.all():
filewriter.writerow([place.uuid, place.city, place.state, place.country])
with open(os.path.join(FOLDER, './csv/person.csv'), 'w',
encoding='UTF8', newline='') as csvfile:
filewriter = csv.writer(csvfile)
filewriter.writerow(
['uuid', 'tmdb_id', 'name', 'main_role', 'birth_place_id', 'image', 'number_of_movies', 'total_rating',
'credit_weighted_score']
)
for person in Person.objects.all():
filewriter.writerow([person.uuid, person.tmdb_id, person.name, person.main_role,
(person.birth_place_id if person.birth_place is not None else ''), person.image,
person.number_of_movies_field, person.total_rating_field,
person.credit_order_weighted_score_field])
with open(os.path.join(FOLDER, './csv/watch_info.csv'), 'w',
encoding='UTF8', newline='') as csvfile:
filewriter = csv.writer(csvfile)
filewriter.writerow(
['uuid', 'name', 'type', 'movie_count', 'category_average', ]
)
for watch_info in WatchInfo.objects.all():
filewriter.writerow(
[watch_info.uuid, watch_info.name, watch_info.watch_info_type, watch_info.movie_count,
watch_info.category_average])
with open(os.path.join(FOLDER, './csv/viewing.csv'), 'w',
encoding='UTF8', newline='') as csvfile:
filewriter = csv.writer(csvfile)
filewriter.writerow(
['uuid', 'movie_id', 'watch_date', 'watch_location_id', 'watch_device_id', 'watch_platform_id',
'viewing_people_ids', 'review', 'my_rating']
)
for viewing in Viewing.objects.all():
filewriter.writerow([viewing.uuid, viewing.movie_id, viewing.watch_date,
(viewing.watch_location_id if viewing.watch_location is not None else ''),
(viewing.watch_device_id if viewing.watch_device is not None else ''),
(viewing.watch_platform_id if viewing.watch_platform is not None else ''),
[x.pk for x in viewing.people_with.all()], viewing.review, viewing.my_rating])
with open(os.path.join(FOLDER, './csv/credit.csv'), 'w',
encoding='UTF8', newline='') as csvfile:
filewriter = csv.writer(csvfile)
filewriter.writerow(
['uuid', 'person_id', 'movie_id', 'department', 'character_job', 'order', 'order_score',
'order_weighted_score']
)
for credit in Credit.objects.all():
filewriter.writerow(
[credit.uuid, credit.person_id, credit.movie_id, credit.department, credit.character_job,
credit.order, credit.order_score, credit.credit_order_weighted_score])
Code and data included throughout.